2D similarity kernels and representations for sequence data

نویسنده

  • Pavel P. Kuksa
چکیده

Analysis of large-scale sequential data has become an important task in machine learning and pattern recognition, inspired in part by numerous scientific and technological applications such as the document and text classification or the analysis of music data, or biological sequences. In this work, we consider general, simple 2D matrix representations of sequences, and introduce a class of 2D similarity kernels that allows efficient inexact matching, comparison and classification of sequence inputs in the form of sequences of Rdim. feature vectors. The developed approach is applicable to a wide range of sequence domains, both discreteand continuousvalued, such as music, images, or biological sequences. Experiments using the new 2D representations and kernels on music genre and artist recognition show excellent predictive performance with significant 25%-40% improvements over the existing state-of-the-art sequence classification methods. Background. A number of state-of-the-art approaches to classification of sequences over finite alphabet Σ rely on measuring sequence similarity using fixed-length representations Φ(X) of sequences as the spectra (|Σ|-dimensional histogram) of counts of short substrings (k-mers), contained, possibly with up tommismatches, in a sequence, c.f., spectrum/mismatch methods [3, 4]. This essentially amounts to analysis of 1D sequences over finite alphabets Σ with 1D k-mers as basic sequence features. However, original input sequences are often in the form of sequences of feature vectors, i.e. each input sequence X is a sequence of R-dim. feature vectors which could be considered as R× |X| feature matrix. Examples of these include

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Algebraic Set Kernels with Application to Inference Over Local Image Representations

This paper presents a general family of algebraic positive definite similarity functions over spaces of matrices with varying column rank. The columns can represent local regions in an image (whereby images have varying number of local parts), images of an image sequence, motion trajectories in a multibody motion, and so forth. The family of set kernels we derive is based on a group invariant t...

متن کامل

Graph Kernels versus Graph Representations: a Case Study in Parse Ranking

Recently, several kernel functions designed for a data that consists of graphs have been presented. In this paper, we concentrate on designing graph representations and adapting the kernels for these graphs. In particular, we propose graph representations for dependency parses and analyse the applicability of several variations of the graph kernels for the problem of parse ranking in the domain...

متن کامل

Efficient multivariate kernels for sequence classification

Kernel-based approaches for sequence classification have been successfully applied to a variety of domains, including the text categorization, image classification, speech analysis, biological sequence analysis, time series and music classification, where they show some of the most accurate results. Typical kernel functions for sequences in these domains (e.g., bag-of-words, mismatch, or subseq...

متن کامل

Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity

MOTIVATION Small molecules play a fundamental role in organic chemistry and biology. They can be used to probe biological systems and to discover new drugs and other useful compounds. As increasing numbers of large datasets of small molecules become available, it is necessary to develop computational methods that can deal with molecules of variable size and structure and predict their physical,...

متن کامل

Generalized Similarity Kernels for Efficient Sequence Classification

String kernel-based machine learning methods have yielded great success in practical tasks of structured/sequential data analysis. In this paper we propose a novel computational framework that uses general similarity metrics and distance-preserving embeddings with string kernels to improve sequence classification. An embedding step, a distance-preserving bitstring mapping, is used to effectivel...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012